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To function effectively, brains need to make predictions about their environment based 
on past experience, i.e., they need to learn about their environment. The algorithms by 
which learning occurs are of interest to neuroscientists, both in their own right (because 
they exist in the brain) and as a tool to model participants' incomplete knowledge of 
task parameters and hence, to better understand their behavior. This review focusses 
on a particular challenge for learning algorithms — how to match the rate at which they 
learn to the rate of change in the environment, so that they use as much observed data 
as possible whilst disregarding irrelevant, old observations. To do this algorithms must 
evaluate whether the environment is changing. We discuss the concepts of likelihood, 
priors and transition functions, and how these relate to change detection. We review 
expected and estimation uncertainty, and how these relate to change detection and 
learning rate. Finally, we consider the neural correlates of uncertainty and learning. We 
argue that the neural correlates of uncertainty bear a resemblance to neural systems 
that are active when agents actively explore their environments, suggesting that the 
mechanisms by which the rate of learning is set may be subject to top down control (in 
circumstances when agents actively seek new information) as well as bottom up control 
(by observations that imply change in the environment). 
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To function efficiently in their environment, agents (humans and 
animals) need to make predictions. We can think of predictions 
being based on an internal model of the environment, stored in 
the brain, which represents information that has been observed, 
and predicts what will happen in future. The process by which 
such a model is constructed and updated may be called a learning 
algorithm. Learning algorithms are of interest to neuroscien- 
tists, partly because such algorithms actually exist in the brain 
(and we would like to understand them) and partly because con- 
structing learning algorithms that model participants' incomplete 
knowledge of task contingencies can help us to understand their 
behavior in experimental paradigms. 

Whilst all knowledge of the environment is arguably acquired 
through learning, learning is particularly important in environ- 
ments that change over time. In this review we are concerned 
with a particular computational problem that arises in com- 
plex changing environments — how should learning algorithms 
adapt their learning rate to match the rate of change of the envi- 
ronment. We will consider two key concepts in inferring the 
rate of change: the likelihood function, by which the likelihood 
that current and past observations were drawn from the same 
distribution is evaluated, and the prior probability of change, 
which constrains how much evidence will be required for the 
learning algorithm to infer that a change has in fact occurred. 
We will relate these two constructs to the concepts of expected 
and estimation uncertainty, and consider the interplay between 
uncertainty and learning. Finally we will consider neural corre- 
lates of uncertainty and learning, and ask whether these are the 



same when learning is driven bottom up by surprising observa- 
tions, and top down as part of the process of actively exploring 
the environment. 

WHY IS CHANGE A CHALLENGE FOR LEARNING 
ALGORITHMS? 

A learning algorithm is an algorithm that makes use of past expe- 
rience to construct a representation of the learned-about subject 
(we wiU call the learned-about subject "the environment" in this 
article). The purpose of learning is to predict future observations 
of the environment and hence respond to them efficiently (Friston 
and Kiebel, 2009; Friston, 2010). Therefore, to function effectively 
it is essential that the representation developed by the learning 
algorithm accurately reflects the current state of the environment 
and/or is predictive oi future environmental states. 

Throughout this review, when I mention a changing environ- 
ment, I mean an environment that changes to an unknown state. 
Environments can change in both predictable and unpredictable 
ways. A predictably changing environment would be a changing 
environment whose state can nevertheless be predicted precisely 
as a function of time — for example, the phases of the moon. An 
unpredictably changing environment could be defined as an envi- 
ronment that undergoes changes that move it to an unknown 
state. For example, the location of the TV remote control in a fam- 
ily living room often behaves like this. In terms of this discussion 
of learning algorithms, we are only really interested in the second 
type of change — in the first case (an environment which changes, 
but predictably) there is nothing new to learn. 
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THE KEY CHALLENGE: HOW FAR BACK SHOULD YOU LOOK? 

Given that the changing environment is not totally random over 
time (in which case learning would be useless), a learning algo- 
rithm can make use of a history of data extending beyond the 
most recently experienced observations, to inform its internal 
representation of the environment. The more past data that can 
be validly used to create a representation of the environment, 
the more accurate the representation is likely to be. However, 
"validly" is the key word because in a changing environment, the 
challenge is to decide exactly which data should be used to cre- 
ate an up-to-date representation, and which data are no longer 
relevant (Doya, 2002; Behrens et al, 2007). 

To illustrate the point: in a stationary environment (an envi- 
ronment which does not change over time), all data from the past, 
no matter how old, could be used to inform an internal repre- 
sentation of the current state of the environment. Therefore, for 
example, in a stationary environment, the mean of all observa- 
tions would give the most accurate estimate possible of the mean 
of the underlying distribution (the environment) from which 
future observations will be drawn. 

In contrast, in a changing (non-stationary) environment, it is 
not true that the distribution of all past observations reflects the 
underlying distribution in force at any particular time point i. On 
the contrary, in a changing environment there is a need for an 
additional layer of processing to work out how observations from 
different times in the past predict future states of the environ- 
ment. For example, if the environment has undergone an abrupt 
change, the best solution may be to identify the change point and 
use all data since that point, disregarding data from prior to the 
change point. There is a trade-off between using as much data 
as possible (to increase the accuracy of the representation) and 
leaving out old data, which may be irrelevant or misleading. 

A SIMPLE WAY TO DISCOUNT OLDER DATA: DECAY KERNELS 

Firstly, to illustrate the problems associated with adjusting to 
the rate of change of the environment, we will consider a sim- 
ple but non-adaptive strategy for discounting old data: namely 
to discount or down-weight older observations. For example, an 
estimate of the mean of the underlying distribution at time point 
i could be based on a running average of the last n observations 
(i — n: i), or a kernel-based average where observations (i — n: 
i) are averaged using a weighting function which down-weights 
older observations (see Figure 1, left hand panels). 

This simple, fixed kernel approach is easy to implement in 
data analysis, and one can imagine how it could be implemented 
simply in a neural network: Incoming observations each activate 
a set of neural nodes which represent them (for example, in a 
spatial map, nodes with spatial receptive fields in which stim- 
uli appear would be activated by these stimuli); activation in the 
nodes decays gradually over time so more recently activated nodes 
contribute more to the total activity within the system, as in a 
"leaky accumulator" model (Usher and McClelland, 2001). This 
can be achieved using a single-layer neural network (Bogacz et al., 
2006). 

However, algorithms like the kernel-based approach just 
described that have a fixed rate of discounting old data rather than 
adjusting their parameters dynamically to account for periods 



of faster and slower change, perform poorly in environments in 
which the relevance of old data does not decay as a simple func- 
tion of time (Figure 1). If the environment has periods of more- 
and less-rapid change, the ideal solution is to adjust the range of 
data that are used to inform the model over time, in accordance 
with how far into the past data are still relevant. 

As an extreme example, consider an environment that has 
periods of stationarity interspersed with sudden changes (as in 
Figure 1). An algorithm that discounts older observations based 
solely on their age, like the simple fixed kernels described above, 
applies the same down-weighting to a past observation i — n 
regardless of whether a change has occurred since that observa- 
tion, or not. If in fact a change has occurred since i — n, then 
the best solution would be to treat observations from before the 
change differently from those made since the change. On the 
other hand, during periods of stability, the best solution would 
be to use as many old observations as possible, not to arbitrarily 
disregard observations on the basis of age. 

To implement a solution in which the range of data adjusts 
to changes in the rate of change of the environment over time, 
a learning algorithm would need some mechanisms by which to 
evaluate the rate of change of the environment. How can this be 
achieved? 

ESTIMATING THE PROBABILITY OF CHANGE 

Consider a clear case in which not all past data are equally 
relevant — an environment which undergoes abrupt changes, 
interspersed with periods of stationarity (periods without 
change) as in Figure 1. How can a learning algorithm effectively 
disregard observations from before an abrupt change, whist using 
as much data as possible during stable periods? To do this, the 
learning algorithm needs to be able to infer the rate of change of 
the environment fi-om the data it observes (Courville et al., 2006; 
Behrens et al, 2007; Wilson et al, 2010; Wilson and Niv, 2011). 

In order to determine the rate of change of the environment, 
a learning algorithm needs to balance two considerations. Firstly, 
how unlikely was it that current observations were drawn from 
the same distribution (the same state of the environment) as 
previous observations? Secondly, how likely are change points 
themselves? — If I thought change points occurred on average 
about every 10,000 trials, I would need more evidence to infer a 
change than if I thought change points occurred on average every 
10 trials (Wilson et al., 2010). We will now consider how these two 
considerations can be formalized. 

INFERRING CHANGE I: THE LIKELIHOOD FUNCTION 

Let's start with the first of our two considerations: How unlikely 
was it that a given observation was drawn from the same distri- 
bution as previous observations? Consider a very simple learning 
task in which on each trial i, a target appears at some location 
across space, x,-. The location is drawn from a Gaussian distribu- 
tion with mean (x and variance a^, such that x,- ~ M(\h, a^). 

Now let's say we observe a data point x;, and we want to know 
from what distribution this data point was drawn. In particular, 
we want to know whether this data point x, was drawn from the 
same distribution as previous data points, or whether a change in 
the environment has occurred, such that the current parameters 
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FIGURE 1 I Algorithms with a fixed temporal discount do not fit well 
to environments with a variable rate of change. The right-hand panels 
illustrate an environment in which observations are drawn from a 
Gaussian distribution; each row shows a different learning algorithm's 
estimate of the distribution mean yi. The mean \l, which has period of 
stability interspersed with sudden change, is shown in black. Actual 
observations x are shown in gray. Estimates of \l are shown in blue. 
The top three rows are kernel-based learning algorithms with different 
time constants. The left hand panels illustrate the three weighting 
functions (kernels) which were used to determine the weighting of 
observations in the panels next to them. The weighting w(y) assigned to 
observation / — / when calculating the mean on observation / is 
defined by the exponential function w(/) = exp(— //n). The rate of decay is 



determined by the constant n, with higher values of n meaning a longer 
period of the past is used. The top row shows a kernel using only very 
recent observations. This tracks the mean yi well, but jumps around a lot 
with individual observations. Note the blue line tracks the gray (data) line 
more closely than it tracks the actual mean \i. (black line). The 2nd and 
3rd rows show kernels using longer periods of the past. This gives a 
much smoother estimate, but is slow to adjust to changes in \^. The 
bottom row shows the output of a Bayesian learning algorithm that 
includes an additional level of processing in order to detect change 
points. Note how unlike the kernel-based algorithms, its estimate is 
stable during periods of stability and changes rapidly in response to 
change in the underlying distribution. 



|X;, cr? are not equal to previous parameters from some putative 
pre-change point, |x,_„, . 

Statisticians would talk about this problem in terms of proba- 
bility and likelihood. We can calculate the probability that a certain 
observation (value of x,) would occur, given some generative dis- 
tribution Xi ~ A/'((JL, o^), where the value of the parameters |x, 
are specified (for example, the probability of observing a value of 
Xi > 3 given that |x = 0 and cr^ = 1 is obtained from the stan- 
dard probability density function for the Normal distribution, 
as p = 0.001). Conversely, we can think about the likelihood that 
the underlying distribution has certain parameters (the likelihood 
that |x, take certain values), given that we have observed a 
certain value of x,-. The likelihood of some values of |x, 0^ given 
observations x can be written as p{[i, conversely the prob- 

ability of some observation x given certain parameters of the 
environment (x, can be written p(Xj ||x, cr^). The two quantities 
are closely related: 



p([i, a \xi) = p(x,\[i., a ) 



(1) 



This relationship gives us a clear way to evaluate whether a change 
point has occurred — given some hypothesis about the parameters 



of the environment [i, that were in force prior to a putative 
change point, we can calculate the probability that an observa- 
tion or set of observations made after the putative change point 
would have been observed given the pre-putative-change param- 
eters of the environment, and hence calculate the likelihood that 
the pre-change parameters are in fact still in force (or conversely, 
the likelihood a change point has occurred). 

It is worth noting that the likelihood function p([i, a^\xi), or 
more generally piparametersl observations) can only be obtained 
in this way if the shape of the distribution from which observa- 
tions are drawn is specified — we cannot estimate the parameters 
of a distribution, if we do not know how that distribution is 
parameterized. The validity pre-specifying the form of the gen- 
erative distribution has been debated extensively throughout the 
twentieth century (McGrayne, 20 11) and we will not rehash that 
debate here — we wiU simply note that whilst a wrong choice of 
distribution could lead to incorrect inferences, in practice it is 
often possible to make an informed guess about the distribution 
from which data are drawn — partly by applying prior experience 
with similar systems, and partly because types of observations fol- 
low certain distributions, for example, binary events can often be 
modeled using a binomial distribution. 
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INFERRING CHANGE II: PRIOR PROBABILITY OF CHANGE AND THE 
TRANSITION FUNCTION 

Now let's address the second consideration for algorithms that 
adapt to the rate of change of the environment: the question 
of how likely change points themselves are, and the probabil- 
ity a-priori of particular transitions in the parameters of the 
environment. 

We have already noted that, intuitively, an observer who 
believes change is improbable a-priori (for example, if the 
observer thinks that a change occurs only every 10,000 obser- 
vations) should demand a higher level of evidence in order to 
conclude that a change has occurred, compared to an observer 
who believes change is frequent in his environment (e.g., if the 
observer thinks the environment changes about once every 10 tri- 
als). Furthermore, different environments can change in different 
ways over time — for example, in some environments the param- 
eters might change smoothly, whilst other environments might 
change abruptly. 

A function that models how the state of the environment 
evolves over time is called the transition function (Courville et al., 
2006). A transition function defines how the state of the environ- 
ment on trial i depends on its state on previous trials — so in the 
Gaussian example, the transition function specifies how the true 
parameters of the environment on trial ; that is (jl, , 0^, depend on 
the true parameters of the environment on previous trials, (jl i, - 1 , 

Different transition functions represent different models of 
how the environment changes over time. For example, we could 
specify that the parameters of the environment vary smoothly 
over time, such that |x, = |x, _ 1 -|- 8[x where 8(jl is small compared 
to Alternatively, we could allow the parameters of the envi- 
ronment to jump to totally new values after a change point, for 
example by specifying: 



[random 



if;^ 



(2) 



. . . where / is a binary variable determining the probability of a 
change, e.g., / follows a binomial B(0.1,l), giving a probability of 
0.1 of a change on any given observation. 

Both the form of the transition function (e.g., smooth change 
vs. jumps) and its parameters (e.g., the probability of a jump 
or the rate of smooth transition) are used to evaluate whether a 
change in the environment has occurred — models with transition 
functions specifying faster rates of change or higher probabil- 
ities of jumps in the parameters of the environment should 
infer change more readily than models that have low a-priori 
expectations of change. 

BAYES' THEOREM AND CHANGE DETECTION 

We have seen that for a learning algorithm to adapt to the rate of 
change in the needs to evaluate the both likelihood of different 
states or parameters of the environment given the data, and the 
probability of change points themselves. These two elements are 
captured elegantly in Bayes' rule, which in this case can be written: 



. . . where 6,- represents the parameters of the environment 
on the current trial i{[ii, 0?) in our Gaussian example, and 
xi:, are the observations on all trials up to and including the 
present one. 

On the right hand side, p(x,|G,) is equal to p(6,|x,), the likeli- 
hood function, due to Equation 1 above; p(9,), the prior proba- 
bility of the parameters 9,, can be thought of as p(9, [xi-j _ 1 ) and is 
obtained from the estimate of the parameters of the environment 
on trial i — 1 via the transition function. For example if we model 
a transition function as in Equation 2, so that the parameters of 
the environment mostly stay the same from one trial to the next 
but can jump to totally new values with some probability q, then 



p(e,) = (i-g)p(e,|xi,_i) + <j([7(9)) 



(4) 



p(G,|xi:,) a p(x,|9,)p(9,) 



(3) 



. . . where p(6,|xi:,_ 1) is the probability that the parameters 9,- 
took some values given all previous observations Xi-, _ 1 , and [7(9) 
is a uniform probability distribution over all possible new values 
of 9, if there had been a change point. 

Bayes' rule expresses a general concept about how an observer's 
beliefs should be updated in light of new observations (for exam- 
ple, whether observations indicate a change in the underlying 
environment); it expresses the idea that the degree to which the 
observer should change his beliefs depends on both the likeli- 
hood that previously established parameters are still in force, and 
the transition function or change-point probability. Hence Bayes' 
rule captures the two considerations we have argued are impor- 
tant for algorithms that respond adaptably to the rate of change 
of the environment. 

Because these considerations relate so closely to Bayes' the- 
orem, it could be argued that any change-detection model that 
considers the likelihood that old parameters are still in force, and 
the prior probability of different parameter values (for example 
based on a transition function) is Bayesian in nature. 

UNCERTAINTY AND LEARNING 

In this review we are interested in how learning algorithms adapt 
to change. A key concept in relation to learning and change 
is uncertainty. There is a natural relationship between uncer- 
tainty and learning in that it is generally true that the purpose 
of learning is to reduce uncertainty, and conversely, the level of 
uncertainty about the environment determines how much can be 
learned (Pearce and Hall, 1980; Dayan and Long, 1998; Dayan 
et al., 2000). We will now see that two types of uncertainty, 
expected uncertainty and estimation uncertainty, which can be 
loosely related to the concepts of likelihood and transition func- 
tion just discussed, play different roles in learning and may have 
distinct neural representations. 

TYPES OF UNCERTAINTY 

Uncertainty can be divided into two constructs — risk or expected 
uncertainty, and ambiguity or estimation uncertainty (Knight, 
1921; Dayan and Long, 1998; Courville et al, 2006; Preuschoff 
and Bossaerts, 2007; Payzan-Lenestour and Bossaerts, 201 1). 

Risk or expected uncertainty refers to the uncertainty which 
arises from the stochasticity inherent in the environment — 
for example, even if an observer knew with certainty that 
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FIGURE 2 I Relationship between the concepts of Expected 
Uncertainty and Lil<elihood. Plot of values of some observed variable x 
against their probability, given two Gaussian distributions with the same 
mean. The red distribution has a lower variance, and hence lower expected 
uncertainty, than the blue distribution. Points a and b represent possible 
observed values of x. For the red and blue distributions, the distance from 
the mean (a — yi) is the same, but at a, the red distribution has higher 
likelihood (because point a has a higher probability under the red 
distribution than the blue distribution) whilst at point b, the blue distribution 
has a higher likelihood. Consider an algorithm assessing evidence that the 
environment has changed. If a datapoint x = b is observed, whether the 
algorithm infers that there has been a change will depend on the variance 
or expected uncertainty of the putative pre-change distribution. If the 
algorithm "thinks" that the red distribution is in force, an observation x = b 
is relatively strong evidence for a change in the environment (as b is unlikely 
under the red distribution) but if the algorithm "thinks" the blue distribution 
is in force, the evidence for change is much weaker, since point b is not so 
unlikely under the blue distribution as it is under the red distribution. 



observations were drawn from some Gaussian distribution x ~ 
A/'((JL, cr^), with known parameter values (known values (jl, 0^), 
he would still not be able to predict with certainty the value 
of the next observation — because observations are drawn 
stochastically from a (known) distribution with some variance, 
a^. Thus, 0^ determines the level of expected uncertainty in this 
environment. 

In contrast, uncertainty that arises from the observer's incom- 
plete knowledge of the environment — in our Gaussian exam- 
ple, uncertainty about the values of |x, 0^ themselves — is called 
estimation uncertainty or ambiguity (Knight, 1921). Estimation 
uncertainty is the type of uncertainty that may be reduced by 
obtaining information, e.g., by increasing the number of obser- 
vations of the environment. Estimation uncertainty generally 
increases when the environment is thought to have changed to 
a new state (since relatively few observations of the new state are 
available). 

Expected uncertainty and estimation uncertainty relate to the 
two factors we previously discussed in relation to change detec- 
tion: the likelihood that the same state of the environment is 
in force now as previously, and the a-priori probability that the 
state of the environment is not what the observer had previously 
thought (determined in part by the transition function). 

Expected uncertainty affects inferences about the likelihood 
that the same state of the environment is in force now as previ- 
ously, because given some observation Xi, the strength of evidence 
for a change in the environment depends not only on how far x, 
falls from the expected value E(x) but also on the estimated vari- 
ance of the distribution from which x is drawn. For example, in 
our Gaussian learning model, for some putative \l, the probabil- 
ity of an observation x, and hence the likelihood of that model 
parameters [i, 0^ take a given value depends both the distance of 
the observation from the putative model mean, x, — jx, and on 
the level of expected uncertainty within the environment, ct^: if 
expected uncertainty (0^) is low, then a given value of (x,- — [i) 
represents stronger evidence against 0^ still being in force, 
compared to if expected uncertainty (0^) was high. This concept 
is illustrated in Figure 2. 

Estimation uncertainty, in contrast, relates more closely to the 
idea of assessing the a-priori probability of change in the envi- 
ronment. Firstly, the strength of belief in any particular past state 
of the environment affects estimation uncertainty — intuitively, if 
the observer is not sure about the state of the environment, he 
may be more willing to adjust his beliefs. Secondly, beliefs about 
the rate or frequency of change in the environment (i.e., about the 
transition function) affects estimation uncertainty because if the 
observer believes the rate of change of the environment to be high, 
then the extrapolation of past beliefs to predictions about the 
future state of the environment is more uncertain. These concepts 
are illustrated in Figures 3, 4. 

In order to illustrate how the effect of expected and estimation 
uncertainty on change point detection translate into an influ- 
ence on learning rate, we can consider a model which observes a 
series of data points from a Gaussian distribution and uses these 
sequentially to infer the parameters of that distribution, whilst 
taking into account the possibility that those parameters have 
jumped to new values, as in Equation 2. Details of this model 



are given in the Appendix and its "behaviour" is illustrated in 
Figure 5. 

In Figure 2 we saw that when expected uncertainty is high, the 
deviation of an observed value or set of values from the distribu- 
tion mean needs to be higher, to offer the same weight of evidence 
for a change in the underlying model parameters, compared to 
when expected uncertainty is low. In the case of our Gaussian tar- 
get locations example, this would mean that when 0^ is believed 
to be high, a given deviation of a sample from the mean (x — (x) is 
weaker evidence for change, compared to when the estimate of 0^ 
is low. In terms of a learning algorithm, this is illustrated in pan- 
els (A) and (B) of Figure 6. Panel (A) shows a case where the true 
mean of the generative distribution changes when 0^ is thought 
to be high (so expected uncertainty is high). Panel (B) shows a 
change of similar magnitude in the generative mean, when 0^ is 
thought to be low. The model adapts much more quickly to the 
change in the distribution mean in the case with lower expected 
uncertainty. 

In contrast, we have argued that the level of estimation uncer- 
tainty or ambiguity is more closely related to the second con- 
sideration, the probability of change itself Consider the process 
by which probability densities over the model parameters are 
updated in our Bayesian learning model. A-priori (before a cer- 
tain data point x,- is observed), if the probability of change is 
believed to be high, estimation uncertainty over the parameters 
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FIGURE 3 I Illustration of estimation uncertainty. These plots show the 
output of a numerical Bayesian estimation of the parameters of a Gaussian 
distribution. If x ~ A/'(m., o^), and some values of x are observed, the 
likelihood of different values for (i, can be calculated jointly using Bayes' 
rule. The colored plots (left) show the joint likelihood for different pairs of 
values 11, a^, where each point on the colored image is a possible pair of 
values 11, a^, and the color represents the likelihood of that pair of values. 
The line plots (Right panel) show the distribution across x implied by 
different values of |i,ct^. The dashed black line is the true distribution from 
which data were drawn. The blue line is the maximum a-posteriori 
distribution — a Gaussian distribution with values of |i, taken from the 
peak of the joint distribution over |i, ct^ shown on the left. The red line 
represents a weighted sum (W.S.) of the Gaussian distributions 
represented by all possible values of (i. a^. weighted by their joint likelihood 
as shown in the figure to the left. The top represents an estimate of the 
environment based on fewer data points than the bottom row. With 
relatively few data points, there is a lot of uncertainty about the values of 
|i,<J^, i.e., estimation uncertainty — illustrated by the broader distribution of 
likelihood over different possible values of |i, (Left panel) in the top than 
bottom row. Whilst the maximum a-posteriori distribution is a good fit to 
the "true" distribution from which data were drawn in both cases, if we 
look at the weighted sum of all distributions, there is a lot more uncertainty 
for the top row case, based on fewer data points. Hence if the observer 
uses a weighted sum of all possible values of \i, of the environment to 
calculate a probability distribution over x, the variance of that distribution 
depends on the level of estimation uncertainty. 



[I and 0^ is also high — this is the effect illustrated in Figure 6. 
Conversely, a-posteriori (after a data point or data points are 
observed), estimation uncertainty is increased if evidence for a 
change-point is observed (i.e., a data point or set of data points 
which are relatively unlikely given the putative current state of 
the environment), (Dayan and Long, 1998; Courville et al., 2006). 
We can see this in Figure 7. As the model starts to suspect that 
the parameters of the environment have changed, the spread 
of probability density across parameter space (i.e., estimation 
uncertainty) increases. As more data are observed from the new 
distribution, the estimate of the new parameters of the envi- 
ronment improves, and estimation uncertainty decreases. Hence 
estimation uncertainty is related to both to the a-priori expecta- 
tion of change, and the a-posteriori probability that a change may 
have occurred. 

The role of estimation uncertainty in determining how much 
can be learned can be related to concepts in both Bayesian 
theory (Behrens et al., 2007) and classic associative learning 



p{x) 



P(X) 




X 



FIGURE 4 I Two considerations for evaluating whether a change has 
occurred. Plots show the probability of observing some value of x, given 
that X ~ AACii, cr^) and the values of \^, can jump to new, unpredicted 
values as defined in Equation 2. When an observation of the environment is 
made, an algorithm that aims to determine whether a change has occurred 
should consider both the likelihood of the previous model of the 
environment given the new data, and the prior probability of change as 
determined in part by the transition function. Top panel: the probability of 
an observation taking a value x is shown in terms of two distributions. A 
Gaussian shown in blue represents the probability density across x if the 
most likely state of the environment (the most likely values of |i, o^), given 
past data, were still in force. The uniform distribution in red represents the 
probability density across x arising from all the possible new states of the 
environment, if a change occurred. The possible new states are 
represented by a uniform function (red line in the figure) because, if we 
consider the probability of each value of x under an infinite number of 
possible states at once (i.e., the value of x given each of infinitely many 
other possible values of (i and cr^), the outcome is a uniform distribution 
over X. A change should be inferred if an observation occurs in the gray 
shaded regions — where the probability of x under the uniform 
(representing change) is higher than the probability under the prior 
Gaussian distribution. Hence the red data point in Figure 4 should cause 
the system to infer a change has occurred, whereas the blue data point 
should not. Bottom panel: as above, the probability distribution over x is a 
combinarion of a Gaussian and a Uniform distribution (representing the 
most likely parameters of the environment if there has been no change, 
and the possible new states of the environment if there has been a change, 
respectively). In this panel, the Gaussian and Uniform components are 
summed to give a single line representing the distribution over x. The 
different colored lines represent different prior probabilities of change, and 
hence different relative weightings of the Gaussian and uniform 
components. Increasing the prior probability of change results in a wider 
distribution of probability density across all possible values of x. 



theory (Pearce and Hall, 1980): in the terminology of classical 
conditioning, estimation uncertainty can be equated with associa- 
bility (Dayan and Long, 1998; Dayan et al., 2000) — associabUity 
being a term in formal learning theory which defines how much 
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FIGURE 5 I Bayesian learner estimates the mean and variance of a 
Gaussian distribution. (A) Data and nnaximum likelihood estimates for 
200 trials. The actual mean and variance of the distribution from which the 
data were drawn (generative distribution) are shown in gray. The gray line is 
the mean and the shaded area is mean ± standard deviation. The model's 
estimates of these parameters are shown superposed on this, in blue. The 
actual data point on which the model was trained are shown as black dots. 
The scale on the y-axis is arbitrary. (B) The probability density function 
across parameter space (for plotting conventions, see Figure 3) for the first 
100 trials. Each parameter-space map represents one trial; trials are shown 
in rows with the first trial number in each row indicated to the left of the 
row. Possible values of ii,- are plotted on the y-axis; possible values of o,- are 
plotted on the x-axis. Colors indicate the joint posterior probability for each 
pair, mu, sigma, after observing data point x,-. Increasing values of sigma 
are plotted from left to right; increasing values of m-/ are plotted from top to 
bottom. Hence, for example on trial 10 (top right) the model thinks li/ is low, 
and a/ is high. Some interesting sequences of trials are highlighted in 
Figures 6, 7. 



can be learned about a given stimulus, where the amount that 
can be learned is inversely related to how much is already 
known about the stimulus (Pearce and Hall, 1980). Low estima- 
tion uncertainty means low associability — which means minimal 
learning. Similarly, estimation uncertainty relates to the learning 
rate — a in the Rescorla- Wagner model of reinforcement learn- 
ing (Rescorla and Wagner, 1972; Behrens et al., 2007) — because 
higher estimation uncertainty is associated with faster learning. 

TOP DOWN CONTROL OF ESTIMATION UNCERTAINTY? 

In a stable environment, estimation uncertainty — uncertainty 
about the parameters of the environment — generally decreases 
over time, as more and more observations are made to be con- 
sistent with a particular state of the environment. Indeed it has 
been argued that the main goal of a self-organizing system like 
the brain is to reduce surprise by improving the match between its 
internal representations of the environment and the environment 
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FIGURE 6 I Learning is faster when expected uncertainty is low. Panels 
(A) and (B) show two sets of trials which include changes of similar 
magnitude in the mean of the generative distribution (distribution from 
which data were in fact drawn). In panel (A), the estimate of o; is high 
(high expected uncertainty) but in panel (B), the estimate of o/ is 
lower — this is indicated by the distribution of probability density from left 
to right in the colored parameter-space maps, and also the width of the 
shaded area |i ± a on the lower plot. The red boxes indicate the set of 
trials shown in the parameter space maps; the red arrow shows which 
parameter space map corresponds to the first trial after the change point. 
Note that the distribution of probability in parameter space changes more 
slowly when expected uncertainty is high (panel A), indicating that learning 
is slower in this case. 
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FIGURE 7 I Change in the environment increases estimation 
uncertainty. Here we see a set of trials during which a change point occurs 
(change point indicated by red arrow). Before the change point, the model 
has low estimation uncertainty (probability density is very concentrated in a 
small part of parameter space, as seen from the first three parameter 
space maps). When the change point is detected, estimation uncertainty 
increases as the model initially has only one data point on which to base its 
estimate of the new parameters of the distribution. Over the next few 
trials, estimation uncertainty decreases (probability density becomes 
concentrated in a smaller part of parameter space again). 



itself (Friston and Kiebel, 2009; Friston, 2010), i.e., to reduce 
estimation uncertainty as well as estimation error. 

Whilst additional observations of the environment tend to 
decrease estimation uncertainty, estimation uncertainty is driven 
up by observations that suggest a change may have occurred in 
the environment: surprising stimuli are associated with increases 
in the learning rate (Courville et al., 2006). We might think of 
this as bottom-up or data-driven control of the level of estimation 
uncertainty in the model, or equivalently the learning rate, or the 
prior expectation of change. 

However, it is also possible to imagine situations in which it 
might be advantageous to control estimation uncertainty (or the 
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learning rate) top down instead of bottom up — i.e., to actively 
increase the learning rate in order to "make space" for new 
information about the environment. One such situation would 
be when an observer is actively exploring his environment and 
hence presumably wishes to adapt his internal model of the envi- 
ronment to take into account the new information obtained by 
exploring. Indeed, change of context (moving an animal from one 
location to another) is associated with increased learning rate in 
experimental animals (Lovibond et al., 1984; Hall and Channell, 
1985; McLaren et al, 1994). 

NEURAL REPRESENTATIONS OF ESTIMATION UNCERTAINTY 
AND LEARNING RATE 

A common set of neural phenomena are associated with the rate 
of learning, processing of stimuli that could indicate a change 
in the environment, and active exploration of the environment; 
these phenomena could be conceptualized computationally in 
terms of control of the level of estimation uncertainty in the 
brain's models of the environment. 

Neuroanatomically, an area of particular interest in relation 
to estimation uncertainty is the anterior cingulate cortex (ACC). 
Activity in the ACC has been shown to correlate with learning rate 
such that, in environments in which the environment changes 
frequently and observers learn quickly about change (i.e., con- 
ditions of high estimation uncertainty), the ACC is more active 
(Behrens et al., 2007). The ACC is also activated when people 
receive feedback about their actions or beliefs that causes them 
to modify their behavior on future trials (and by implication, to 
modify their internal model of the environment) (Debener et al., 
2005; Cohen and Ranganath, 2007; Matsumoto et al, 2007) — 
this activity, which has been observed using fMRI and electro- 
physiological recordings, is probably the source of the error- or 
feedback-related negativity (ERN; Debener et al., 2005). 

Interestingly, ACC activity may be more closely related to the 
forgetting of old beliefs about the environment (and hence the 
increasing of estimation uncertainty), than to new learning. In 
a particularly relevant study Karlsson et al. (2012), showed that in 
rats performing a two-alternative probabilistic learning task, pat- 
terns of activity in the ACC underwent a major change in activity 
when the probabilities associated with each of the two options 
reversed. Importantly, rats' behavior around a probability reversal 
(when the values associated with each lever switched) had three 
distinct phases — before the reversal, rats showed a clear prefer- 
ence for the high value lever, but when the probabilities reversed 
there was a period in which the rats showed no preference for 
either lever (they probed each lever several times as if working out 
the new values associated with each lever) before settling down 
into a new pattern of behavior that favored the new high value 
lever. The ACC effect was associated with the point at which rats 
abandoned their old beliefs about the environment in favor of 
exploration and the acquisition of new information (and hence, 
should have had raised levels of estimation uncertainty) — rather 
than at the time at which a new model of the environment started 
to govern behavior. 

Further experiments have reported ACC activity when partici- 
pants make the decision to explore their environment rather than 
to exploit known sources of reward (Quilodran et al., 2008), or 



to forage for new reward options rather than choosing between 
those options immediately available to them (Kolling et al., 
2012) — again, these are cases in which estimation uncertainty in 
the brain's internal models could be actively raised, to facilitate the 
acceptance of new information in the new environment (Dayan, 
2012). 

Neurochemically, Dayan and colleagues have proposed that the 
neuromodulator noradrenaline (also called norepinephrine) sig- 
nals estimation uncertainty. Evidence from pupilometry studies 
suggests that noradrenaline levels [which are correlated with pupil 
dilation (Aston-Jones and Cohen, 2005)] are high when estima- 
tion uncertainty is high in a gambling task (Preuschoff et al., 
2011). Increases in pupil dilation have been demonstrated both 
circumstances that should drive estimation uncertainty bottom- 
up [when data are observed that suggest a change point has 
occurred (Nassar et al, 2012)], and top down [during exploratory 
behavior (Nieuwenhuis et al, 2005)]. 

PupU diameter is increased in conditions when observers think 
the rate of change in the environment is high, and is phasically 
increased when observers detect a change in the environment 
(Nassar et al, 2012). Hence tonic noradrenaline levels could be 
said to represent the prior probability of change in the envi- 
ronment, whilst phasic noradrenaline may represent a-posteriori 
evidence (based on sensory input) that a change is occurring or 
has occurred at a given time point (Bouret and Sara, 2005; Dayan 
andYu, 2006; Sara, 2009). 

Interestingly, whilst events which are surprising in relation to 
a behaviorally-relevant model of the environment are associated 
with an increase in noradrenaline release [29,30] and pupil diam- 
eter J31], it has also been shown that irrelevant surprising events 
which cause an increase in pupQ diameter also cause an increase 
in learning rate (Nassar et al, 2012) suggesting a rather general- 
ized mechanism by which the malleability of neural circuits may 
be affected by surprise, in accordance with behavioral evidence 
that surprising events affect the learning rate (Courville et al., 
2006). 

The mechanism by which noradrenaline represents or controls 
estimation uncertainty is not known, although two appealing 
theoretical models are that noradrenaline acts on neural mod- 
els of the environment by adjusting the gain function of neurons 
(Aston-Jones and Cohen, 2005), or by acting as a "reset" signal 
that replaces old models of the environment with uninformative 
distributions, to make space for new learning (Bouret and Sara, 
2005; Sara, 2009). 

The involvement of the ACC and noradrenaline in the con- 
trol/representation of estimation uncertainty may be linked, 
because the ACC has strong projections to the nucleus that 
produces noradrenaline, the locus coeruleus (Sara and Herve- 
Minvielle, 1995; Jodo et al, 1998). 

Whilst there is currently little consensus on the representation 
of learning rate and uncertainty in the brain, the data reviewed 
here do begin to suggest a mechanism by which estimation 
uncertainty and learning rate are controlled neurally, which is 
involved both when uncertainty/learning is driven bottom-up (by 
observations that suggest the environment is changing) and when 
they are driven top-down (such as when agents actively quit a 
familiar environment and explore a novel one). 
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APPENDIX 

LEARNING MODEL FOR FIGURES 4-7 

Let data x be drawn from a Gaussian distribution with unknown 
mean (jl and variance a^. The values of (jl and occasionally 
jump to new values; the probability of such a jump occurring 
between and pair of observations is fixed at some value q. For 
simplicity in this example we assume q is known, but it is also 
possible to infer q from the data (Nassar et al., 2010; Wilson et al, 
2010). 

Then the structure of the environment can be described as 
follows: 

X, ~ AA((JL„ a]) (5) 
2 fM-,-i,o-_i if/ = 0 

^'^'^i = 1 tt2/ 2 2 \ ■fj_ (6) 

where / is a binary variable determining the probability of a jump, 
such that / follows a BernouiUi with probability q. 

I ~ B(q) (7) 



Then the values for (jl, and ct, can be inferred from the data using 
Bayes' rule as follows: 

p([L„ CT- |xi : ,) = p{x,\[L„ a- )p(ii„ CT- (8) 

where the likelihood is 

p([i„ a]\x,) = p(x,\[i„ aj) ~ 7V((JL„ a]) (9) 

. . . and the prior is derived from the posterior on the previous 
trial, incorporating a uniform "leak" over parameter space to 
represent the possibility that the values of the parameters have 
changed since the previous observation: 

P(M.,-, CT-|Xl:,-l) = (1 -<?)p(H-,-i,(T,_i) 

+ q(U^(|linin, M-max.f^^min.fJ^max)) (10) 

On trial 1, the prior over parameter space is uniform. 
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